logo

Introduction

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise, and avoid long printouts. Feel free to add in as many new code chunks as you’d like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data, and use the provided data as truth (for example, some “home” games have been played at secondary locations, including TOR’s entire 2020-21 season. These are not reflected in the data and you do not need to account for this.) Note that the OKC and DEN 2024-25 schedules in schedule_24_partial.csv intentionally include only 80 games, as the league holds 2 games out for each team in the middle of December due to unknown NBA Cup matchups. Do not assign specific games to fill those two slots.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1: 26 4-in-6 stretches in OKC’s draft schedule.

Question 2: 25.1 4-in-6 stretches on average.

Question 3:

  • Most 4-in-6 stretches on average: CHA (28.1)
  • Fewest 4-in-6 stretches on average: NYK (22.2)

Question 4: This is a written question. Please leave your response in the document under Question 4.

Question 5:

  • BKN Defensive eFG%: 54.5%
  • When opponent on a B2B: 53.6%

Part 2

Please show your work in the document, you don’t need anything here.

Part 3

Question 9:

  • Most Helped by Schedule: MIL (260 wins)
  • Most Hurt by Schedule: DET (94 wins)

Setup and Data

library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule <- read_csv("~/Downloads/schedule.csv")
draft_schedule <- read_csv("~/Downloads/schedule_24_partial.csv")
locations <- read_csv("~/Downloads/locations.csv")
game_data <- read_csv("~/Downloads/team_game_data.csv")

Part 1 – Schedule Analysis

In this section, you’re going to work to answer questions using NBA scheduling data.

Question 1

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.
library(tidyverse)
draft_schedule_OKC <- draft_schedule %>%
  filter(team == "OKC")

count_4in6<- function(data, date_col){
  data %>%
    # Standarize the data 
    transmute(gamedate = as.Date({{date_col}})) %>% 
    arrange(gamedate) %>%
    mutate(n_games = purrr::map_int(gamedate, ~ sum(gamedate >= .x & gamedate <= .x+5))) %>%
    filter(n_games >= 4) %>%
  mutate(window_end = gamedate + 5) %>%
  select(window_start = gamedate, window_end, n_games)
}
res <- count_4in6(draft_schedule_OKC, gamedate)
res
## # A tibble: 26 × 3
##    window_start window_end n_games
##    <date>       <date>       <int>
##  1 2024-10-30   2024-11-04       4
##  2 2024-11-01   2024-11-06       4
##  3 2024-11-06   2024-11-11       4
##  4 2024-11-08   2024-11-13       4
##  5 2024-11-10   2024-11-15       4
##  6 2024-11-15   2024-11-20       4
##  7 2024-12-26   2024-12-31       4
##  8 2024-12-28   2025-01-02       4
##  9 2024-12-29   2025-01-03       4
## 10 2024-12-31   2025-01-05       4
## # ℹ 16 more rows
nrow(res)
## [1] 26

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

by_team_season <- schedule %>%
  filter(season >= 2014, season <= 2023) %>%
  group_by(team,season) %>%
  summarise(games_played = n(),
            four_in_six = nrow(count_4in6(pick(gamedate), gamedate)), 
            .groups = "drop"
  ) %>%
  mutate(four_in_six_per82 = four_in_six * 82 / games_played)
overall_avg_q2 <- by_team_season %>% 
  summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE))
overall_avg_q2
## # A tibble: 1 × 1
##   avg_4in6_per82
##            <dbl>
## 1           25.1

ANSWER 2:

25.1 4-in-6 stretches on average.

Question 3

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

by_team_season <- schedule %>%
  filter(season >= 2014, season <= 2023) %>%
  group_by(team,season) %>%
  summarise(games_played = n(),
            four_in_six = nrow(count_4in6(pick(gamedate), gamedate)), 
            .groups = "drop") %>%
  mutate(four_in_six_per82 = four_in_six * 82 / games_played)

#Avg per team across seasons 
team_avgs <- by_team_season %>%
  group_by(team) %>%
  summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE)) %>%
  arrange(desc(avg_4in6_per82))

#Identify highest and lowest 
highest <- team_avgs %>% slice(1)
lowest  <- team_avgs %>% slice(n()) 

highest 
## # A tibble: 1 × 2
##   team  avg_4in6_per82
##   <chr>          <dbl>
## 1 CHA             28.1
lowest
## # A tibble: 1 × 2
##   team  avg_4in6_per82
##   <chr>          <dbl>
## 1 NYK             22.2

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA (28.1)
  • Fewest 4-in-6 stretches on average: NYK (22.2)

Question 4

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

The difference between the most CHA (28.1) and least NYK(22.2) is nearly 6 stretches per 82 games. This difference is likely the result of chance, since it is fairly small compared to the amount of teams in the league and the 10 seasons considered.

ANSWER 4:

Question 5

QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

bkn_games <- game_data %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  filter(def_team == "BKN" , season == 2023) %>%
  mutate(opp_fgm = fg2made + fg3made,
         opp_fga = fg2attempted + fg3attempted,
         opponent_efg_pct = (opp_fgm + 0.5 * fg3made) / opp_fga)

#Overall defensive eFG%
bkn_def_efg <- mean(bkn_games$opponent_efg_pct, na.rm = TRUE)

#Add B2B info 
opp_b2b <- schedule %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  filter(season == 2023) %>%
  arrange(team, gamedate) %>%
  group_by(team) %>%
  mutate(prev_game = lag(gamedate), b2b = (gamedate - prev_game == 1)) %>%
  ungroup() %>%
  select(team, gamedate, b2b)

bkn_games_b2b <- bkn_games %>%
left_join(opp_b2b %>% select(team, gamedate, b2b), by = c("off_team" = "team", "gamedate" = "gamedate"))

#Defensive eFG% when opponent on B2B
bkn_def_efg_b2b <- mean(bkn_games_b2b$opponent_efg_pct[bkn_games_b2b$b2b == TRUE], na.rm = TRUE)

#Round 
round(bkn_def_efg * 100,1)
## [1] 54.5
round(bkn_def_efg_b2b * 100, 1)
## [1] 53.6

ANSWER 5:

  • BKN Defensive eFG%: 54.5%
  • When opponent on a B2B: 53.6%

Part 3 – Modeling

Question 9

QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.

If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).

sched_feat <- schedule %>%
  mutate(gamedate = as.Date(gamedate),
         away = (home == 0)) %>%
  filter(season >= 2019, season <= 2023) %>%
  arrange(team, gamedate) %>%
  group_by(team, season) %>%
  mutate(
    prev_date = lag(gamedate),
    next_date = lead(gamedate),
    
#B2B
    b2b_any = (!is.na(prev_date) & (gamedate - prev_date == 1)) |
              (!is.na(next_date) & (next_date - gamedate == 1)),

#Count 4in6
n_in_6 = purrr::map_int(gamedate, \(d) sum(gamedate >= d & gamedate <= d + 5)),
four_in_six = (n_in_6 >= 4),

#Road Trips 
prev_away = lag(away, default = FALSE),
new_trip  = away & !prev_away,
trip_id   = ifelse(away,cumsum(new_trip), NA_integer_)
) %>%
group_by(team, season, trip_id) %>%
mutate(road_trip_index = ifelse(away, row_number(), 0L)) %>%
ungroup() %>%
mutate(
  b2b             = as.integer(b2b_any),
  four_in_six     = as.integer(four_in_six),
  road_trip_index = replace_na(road_trip_index, 0L),
  home            = as.integer(home),
  win             = as.integer(win)
) %>%
select(season, team, gamedate, home, away, win, b2b, four_in_six, road_trip_index)

#Fit model 

sched_scored <- sched_feat %>%
  filter(season >= 2019, season <= 2023)

sched_model <- glm(
  win ~ b2b + four_in_six + road_trip_index + home,
  data = sched_scored,
    family = binomial()
) 
summary(sched_model)
## 
## Call:
## glm(formula = win ~ b2b + four_in_six + road_trip_index + home, 
##     family = binomial(), data = sched_scored)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.19060    0.05406  -3.526 0.000422 ***
## b2b             -0.14948    0.03956  -3.778 0.000158 ***
## four_in_six      0.03421    0.04083   0.838 0.402086    
## road_trip_index  0.01202    0.02052   0.586 0.558168    
## home             0.43869    0.05686   7.715 1.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16161  on 11657  degrees of freedom
## Residual deviance: 16017  on 11653  degrees of freedom
## AIC: 16027
## 
## Number of Fisher Scoring iterations: 3
sched_scored$pred_win_prob <- predict(sched_model, newdata = sched_scored, type = "response")

sched_scored <- sched_scored %>%
  mutate(expected_wins = ifelse(pred_win_prob > 0.50, 1, 0))

#Actual vs Expected 
  team_summary <- sched_scored %>%
    group_by(team) %>%
    summarise(
      total_actual_wins   = sum(win, na.rm = TRUE),
      total_expected_wins = sum(expected_wins, na.rm = TRUE),
      diff          = total_actual_wins - total_expected_wins,
      .groups       = "drop"
    ) %>%
  arrange(diff)

most_hurt   <- slice_head(team_summary, n = 1)
most_helped <- slice_tail(team_summary, n = 1)

most_helped
## # A tibble: 1 × 4
##   team  total_actual_wins total_expected_wins  diff
##   <chr>             <int>               <dbl> <dbl>
## 1 MIL                 260                 195    65
most_hurt
## # A tibble: 1 × 4
##   team  total_actual_wins total_expected_wins  diff
##   <chr>             <int>               <dbl> <dbl>
## 1 DET                  94                 191   -97

ANSWER 9:

  • Most Helped by Schedule: MIL (260 wins)
  • Most Hurt by Schedule: DET (94 wins)

I fit a logistic regression model on games from 2019 to 2023 with wins as the outcome and schedule features as predictors. The estimates indicate that back to backs are influential in reducing win probability, while home court advantage increases a team’s chances to win. The four in six and road trip indicators are small so they are not significant to the outcome. Overall, the model points to venue and back to back density as the schedule factors with clear impact.